Machine translation: statistical approach with additional linguistic knowledge

نویسندگان

  • Maja Popović
  • Hermann Ney
  • Nicola Ueffing
  • Franz Josef Och
چکیده

In this thesis, three possible aspects of using linguistic (i.e. morpho-syntactic) knowledge for statistical machine translation are described: the treatment of syntactic differences between source and target language using source POS tags, statistical machine translation with a small amount of bilingual training data, and automatic error analysis of translation output. Reorderings in the source language based on the POS tags are systematically investigated: local reorderings of nouns and adjectives for the Spanish–English language pair and long-range reorderings of verbs for the German–English language pair. Both types of reorderings result in better performance of the translation system, local reordering being more important for the scarce training corpora. For such corpora, strategies for achieving an acceptable translation quality by applying appropriate morpho-syntactic transformations are exploited for three language pairs: Spanish– English, German–English and Serbian–English. Very scarce task-specific corpora as well as conventional dictionaries are used as bilingual training material. In addition to conventional dictionaries, the use of phrasal lexica is proposed and investigated. A framework for automatic analysis and classification of actual errors in translation output based on combining existing automatic evaluation measures with linguistic information is presented. Experiments on different types of corpora and various language pairs show that the results of automatic error analysis correlate very well with the results of human evaluation. The new metrics based on analysed error categories are used for comparison of different translation systems trained on various sizes of texts with and without morpho-syntactic transformations. For improving the quality of a statistical machine translation system by the use of morphosyntactic information, the choice of the method and the significance of improvements strongly depend on the language pair, the translation direction and the nature of the corpus. Error analysis of the translation output is important in order to define weak points of the system and apply methods for improvement in the optimal way.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning

Linguistic resources such as part-ofspeech (POS) tags have been extensively used in statistical machine translation (SMT) frameworks and have yielded better performances. However, usage of such linguistic annotations in neural machine translation (NMT) systems has been left under-explored. In this work, we show that multi-task learning is a successful and a easy approach to introduce an additio...

متن کامل

Using Linguistic Knowledge in Statistical Machine Translation

In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). One of the advantages of the statistical approach to machine translation is that it is largely languageagnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using lingui...

متن کامل

Statistical machine translation from Slovenian to English

In this paper, we analyse three statistical models for the machine translation of Slovenian into English. All of them are based on the IBM Model 4, but differ in the type of linguistic knowledge they use. Model 4a uses only basic linguistic units of the text, i.e., words and sentences. In Model 4b, lemmatisation is used as a preprocessing step of the translation task. Lemmatisation also makes i...

متن کامل

A Hybrid Machine Translation System Based on a Monotone Decoder

In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...

متن کامل

Linguistically Annotated BTG for Statistical Machine Translation

Bracketing Transduction Grammar (BTG) is a natural choice for effective integration of desired linguistic knowledge into statistical machine translation (SMT). In this paper, we propose a Linguistically Annotated BTG (LABTG) for SMT. It conveys linguistic knowledge of source-side syntax structures to BTG hierarchical structures through linguistic annotation. From the linguistically annotated da...

متن کامل

Toponym Disambiguation in English-Lithuanian SMT System with Spatial Knowledge

This paper presents an innovative research resulting in the English-Lithuanian statistical factored phrase-based machine translation system with a spatial ontology. The system is based on the Moses toolkit and is enriched with semantic knowledge inferred from the spatial ontology. The ontology was developed on the basis of the GeoNames database (more than 15 000 toponyms), implemented in the we...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009